Specs Sheet
Variable Block Database Version 1031, Revision A
Topics |
|
The Variable Block Database is a file format used to store any type of variable-length binary data in re-sizeable blocks. The VBD file format was originally created to store persistent objects in disk files. A persistent object refers to an object that saves its state between program invocations. The VBD file format is not limited to storing persistent objects. Any type of data can be stored and manipulated in a VBD file. VBD files have this capability because the methods by which any data type is stored or retrieved must be defined within the application.
Every VBD file contains a file header, an optional static storage area, and a dynamic storage area where variable blocks of data are stored. The file header is composed of six fields and is a total of 28 bytes in length:
(1) Free space field - stores a pointer to the first block of de-allocated heap space (4 bytes)
(2) End of file field - stores a pointer to the address of the last byte in the file (4 bytes)
(3) Start of heap field - stores a pointer to the start of the dynamic storage area (4 bytes)
(4) Highest block field - stores a pointer to the highest allocated block (4 bytes)
(5) Signature field - signature used to identify the file type (8 bytes)
(6) Version field - used to identify the VBD file version number(4 bytes)
This header is used to store information needed by the allocation functions, a signature, and a version number. The static data area is used to store fixed data that cannot be altered by any of the dynamic allocation routines. The size of the static data must by specified by the application that creates the VBD file. If no static area is specified then the dynamic data area will start directly after the VBD file header.
The first 28 bytes of the VBD file (file addresses 0 - 28) will always contain the VBD header. If a static area is requested by the application, the static area will occupy the bytes requested. For example: if an application request 20 bytes to be reserved for a static data area when the file is created, then bytes 29 through 48 (file addresses 29 - 48) will be reserved. The dynamic data area will start directly after that and occupy the rest of the file. A pointer to the start of the dynamic storage area can be calculated by adding the size of the VBD file header to the size of the static data area.
The dynamic data area always starts out empty and grows when variable data blocks are allocated. The "free space", "end of file", and "start of heap" pointers stored in the VBD file header are used to maintain the dynamic data area. The "start of heap" pointer stores the address where the dynamic data area starts (which is where the static area ends.) The "end of file" pointer marks the end of the file and points to the location where the file can be extended during block allocation.
Revision 'A' adds a 32-bit CRC checksum routine used to detect any bit errors that occur during data storage. The CRC is based on the Ethernet polynomial of 0x4C11DB7. A checksum is calculated when data is written to the VBD file, this includes the block header and the block data. The calculated checksum is then compared to data actually stored on disk. If the calculated checksum does not match the actual checksum, a bit error has occurred during data storage. All bit errors must be handled by the application since the type of data being stored is not known.
Revision 'A' reserves four bytes at the end of each block that can be used by an application to store a 32-bit checksum with each block. Each time a new block is allocated space is reserved for the number of bytes requested plus four additional bytes. The application is responsible for reading and writing the block checksum since the type of data being stored is not known. The use of a block checksum is optional and does not have to be used to detect bit errors because of the built-in CRC checksum routine used each time any data is stored. Some applications require the use of block checksums to maintain the integrity of the file from one program invocation to the next.
When blocks are de-allocated, the block is marked deleted and left in the file. The size of the file is determined by the number of blocks allocated and will remain the same no matter how many blocks are deleted. The number of deleted blocks is maintained in a non-contiguous list. Each block in the list points to the next block in the list starting at a specified address. The "free space" pointer in the VBD file header is used to store the file address of the block where the free space list starts.
The "highest block" pointer in the VBD file header is used to store the address of the highest allocated variable block. It is used by the application when it needs to index data starting at the end of the file.
The signature and version fields in the VBD file header must be set when a new file is created. The signature field is used to determine if the file is of the correct type. If the "VBDFILE" signature is not found then it is assumed that this is not a valid VBD file. The eighth byte is used for all revision changes. Version 1031 sets the revision letter to 'A' and performs compatibility checks to ensure backward compatibility with previous releases. Revision 'A' adds a 32-bit CRC checksum routine and reserves space at the end of each block for an application to write a 32-bit check-word.
The version field is to represent this file version number. A version number is used to indicate that changes have been made to the application used to create VBD files. Version numbers can be used inside of an application to conditionally perform certain operations based on its value. This will ensure backward compatibility with previous versions.
Variable Block (VB) headers manage all the variable data blocks created inside the dynamic data area. Block headers are used to mark the start of a variable data block. Every time a block is allocated a block header is written to the file. Allocation works by writing the block header and then reserving a specified number of bytes for the data that will be stored in the block. Revision 'A' reserves four bytes at the end of the block to allow the application to store a block checksum. After the space is allocated, the application is responsible for writing the data to the file starting at the address after the block header. Each block header contains four fields and is a total of 16 bytes in length:
(1) Check word field - "0x0000FEFE" check word used for file integrity checks (4 bytes)
(2) Length field - block length including the object, block header, and CRC (4 bytes)
(3) Status field - stores the status of dynamic data stored in the block (4 byte)
(4) Next deleted block field - stores a pointer to the next deleted block (4 bytes)
The VB header "check word" field represents a 32-bit check word used for file integrity checks and is used to maintain synchronization within the VBD creator and the application. Revision 'A' uses all 32 bits of the block check-word. In previous releases the upper 16 bits of the block check-word was reserved for a 16-bit CRC. Revision 'A' implements a 32-bit CRC routine and reserves four bytes at the end of each block for a 32-bit checksum. If the check word value is changed, the file will no longer be compatible with previous releases.
The "length" field stores the length of the object plus the size of the block header and the size of the block's CRC checksum value. This field is used to index the file block by block. By reading this value the application will always know where the next block is in sequence. The check-word is used to ensure that the next block in sequence is a valid variable block. The length of the object can be calculated by subtracting the size of the block header and the size of the CRC checksum from the "length" field.
The "status" field stores the status of dynamic data stored in this block. Only one byte of the status field is used. The remaining three bytes of the status field is reserved for future use. The status of a variable data block can be determined by one of three byte values: 'N' for normal (ASCII 78), 'D' for deleted (ASCII 68), or 'R' for removed (ASCII 82.) A block marked 'N' for normal indicates that the block is in use and cannot be reclaimed by any of the allocation routines. A block marked 'D' for deleted means that the data in the block is still valid, but the block can be overwritten if needed. A block marked 'R' for removed means that the data in the block has been removed and the block can be overwritten if needed. Marking deleted blocks 'N' for normal can easily restore them. Once a block is removed it can never be restored.
The "next deleted block" pointer in the VB header stores a pointer to the next deleted or removed variable block, only if this block has been deleted or removed. The total number of deleted blocks is maintained in a non-contiguous list. Each deleted or removed block in the list points to the next deleted or removed block in the list starting at the head of the list. The "free space" pointer in the VBD file header is used to store the file address where the head of the free space list is located. When a variable block is deleted or removed the next deleted block field is set to the VBD header's free space value to become the head of the free space list and the VBD header's "free space" pointer is set to the address of this deleted or removed block.
In order to prevent the VBD files from becoming extremely fragmented due to numerous deletions, blocks marked deleted or removed will be reused by the allocation routine. When a new variable block is allocated and the "free space" field is not empty, the allocation routine will walk through the free space list looking for a deleted or removed block of the size to be allocated.
One of two methods can be used to prevent fragmentation, the best-fit method or the first-fit method. The best-fit method works by scanning the entire free space list until the best location to reuse a block is found. Best-fit values are calculated based on a percentage of the number of bytes requested. Any blocks greater then 2.5 times bigger then the number of bytes requested will not be reused. This ensures that smaller blocks will not use small portions of very large blocks and all new blocks will use the maximum amount of space in any block that is reused. However, the best-fit method is very costly in terms of speed when the free space list is very large.
The first-fit method works by searching the free space list until the first block of the appropriate size is found. The first block large enough to hold two block headers plus the number of bytes requested with at least one byte left over will be reused. Several splits can occur if the blocks vary greatly in size. When a block is split, the unused portion of the block is assigned a new block header (marked removed) and placed back on the free space list. A small block can cause a very large block to be divided several times leaving gaps of smaller and smaller blocks.
The reclaim method used is application dependent. If all the blocks are more or less the same size the first-fit method is more efficient terms of speed. If all the blocks vary greatly in size the best-fit method is more efficient in preventing fragmentation.
VBD version 1031, revision 'A' requires that all file addresses be represented by 32-bit signed integer values. This means that the file will be allowed to grow to a maximum size of 2.1 GB. The VBD file header must occupy 28 bytes of disk space starting at file address zero. The header must be comprised of four 32-bit signed integers and an eight-byte character array followed by another 32-bit signed integer. A VBD file's static data area will only occupy as many bytes as requested. Every variable data block will be stored directly after the static data and be marked with a block header. The block header must occupy 16 bytes of disk space and be comprised of four 32-bit unsigned integers. The data following any variable block header can have as many bytes as requested, up to 2.1 GB. Four bytes will be reserved at the end of each block to store an optional 32-bit checksum with each block.
File addresses are multi-byte numbers used to point to specific locations in a disk file. The byte order in which multi-byte numbers are stored in a computer's memory is specific to each microprocessor. For example, the Intel x86 families of microprocessors store the lowest-order byte first. Hewlett Packard's PA-RISC families of microprocessors store the highest-order byte first. A file address stored in a disk file will represent different values if the file is created on one system and read on the other system.
In order to gain platform independent VBD files must use their own representation of 32 bit-signed integers for file addresses. By manipulating the file addresses in memory before they are written to disk, it is possible to represent 32 bit signed integers independently of the operating system or hardware platform used. This will overcome any byte ordering problems encountered when writing file addresses to a common database file accessed by several different types of machines.
The same scheme used to overcome the byte ordering problem encountered with 32-bit signed integers must also be applied to 32-bit unsigned integers and 16-bit integers if any of these data types are stored in the file. Also a similar method must be implemented for single and double precision floating point values if any of these types are used.
String values are represented in memory exactly as they appear. Since none of the bytes in a string are reordered in memory, they can be written directly from memory to disk regardless of the platform used to create them.